Tutorial Brief

In this tutorial we will cover the basics of using Google Custom Search to search the Interner.

Links:

Video Tutorial:

http://youtu.be/ewPJs4N8d8M


In [1]:
import json
import requests
import pandas as pd

Understanding Google Custom Search

Google Custom Search replaces the depreciated Google Search API. It is designed to search one or more website and to embedded with in the website.

There is still an options to search the complete web. This options combined with no specified website to search retuen results which are very close to what you get when you search Google. The difference in results is due to personalized and localized search results that Google search returns.


In [2]:
key = ""
cx = ""
Parameter name Value Description
Required parameters
q string The search expression.
Optional parameters
c2coff string Enables or disables Simplified and Traditional Chinese Search
  • The default value for this parameter is 0 (zero), meaning that the feature is enabled. Supported values are:
    • 1: Disabled
    • 0: Enabled (default)
cr string Restricts search results to documents originating in a particular country.
  • You may use Boolean operators in the cr parameter's value.
  • Google Search determines the country of a document by analyzing:
    • the top-level domain (TLD) of the document's URL
    • the geographic location of the Web server's IP address
  • See the Country Parameter Values page for a list of valid values for this parameter.
cref string The URL of a linked custom search engine specification to use for this request. 
  • Does not apply for Google Site Search
  • If both cx and cref are specified, the cx value is used
cx string The custom search engine ID to use for this request.
  • If both cx and cref are specified, the cx value is used.
dateRestrict string Restricts results to URLs based on date. Supported values include:
  • d[number]: requests results from the specified number of past days.
  • w[number]: requests results from the specified number of past weeks.
  • m[number]: requests results from the specified number of past months.
  • y[number]: requests results from the specified number of past years.
exactTerms string Identifies a phrase that all documents in the search results must contain.
excludeTerms string Identifies a word or phrase that should not appear in any documents in the search results.
fileType string Restricts results to files of a specified extension. A list of file types indexable by Google can be found in Webmaster Tools Help Center.
filter string Controls turning on or off the duplicate content filter.
  • See Automatic Filtering for more information about Google's search results filters. Note that host crowding filtering applies only to multi-site searches.
  • By default, Google applies filtering to all search results to improve the quality of those results.


Acceptable values are:
  • "0": Turns off duplicate content filter.
  • "1": Turns on duplicate content filter.
gl string Geolocation of end user. 
  • The gl parameter value is a two-letter country code. The gl parameter boosts search results whose country of origin matches the parameter value. See the Country Codes page for a list of valid values.
  • Specifying a gl parameter value should lead to more relevant results. This is particularly true for international customers and, even more specifically, for customers in English- speaking countries other than the United States.
googlehost string The local Google domain (for example, google.com, google.de, or google.fr) to use to perform the search. 
highRange string
  • Specifies the ending value for a search range.
  • Use lowRange and highRange to append an inclusive search range of lowRange...highRange  to the query.
hl string Sets the user interface language. 
hq string Appends the specified query terms to the query, as if they were combined with a logical AND operator.
imgColorType string Returns black and white, grayscale, or color images: mono, gray, and color.

Acceptable values are:
  • "color": color
  • "gray": gray
  • "mono": mono
imgDominantColor string Returns images of a specific dominant color.

Acceptable values are:
  • "black": black
  • "blue": blue
  • "brown": brown
  • "gray": gray
  • "green": green
  • "pink": pink
  • "purple": purple
  • "teal": teal
  • "white": white
  • "yellow": yellow
imgSize string Returns images of a specified size.

Acceptable values are:
  • "huge": huge
  • "icon": icon
  • "large": large
  • "medium": medium
  • "small": small
  • "xlarge": xlarge
  • "xxlarge": xxlarge
imgType string Returns images of a type.

Acceptable values are:
  • "clipart": clipart
  • "face": face
  • "lineart": lineart
  • "news": news
  • "photo": photo
linkSite string Specifies that all search results should contain a link to a particular URL
lowRange string Specifies the starting value for a search range.
Use lowRange and highRange to append an inclusive search range of lowRange...highRange to the query.
lr string Restricts the search to documents written in a particular language (e.g., lr=lang_ja).

Acceptable values are:
  • "lang_ar": Arabic
  • "lang_bg": Bulgarian
  • "lang_ca": Catalan
  • "lang_cs": Czech
  • "lang_da": Danish
  • "lang_de": German
  • "lang_el": Greek
  • "lang_en": English
  • "lang_es": Spanish
  • "lang_et": Estonian
  • "lang_fi": Finnish
  • "lang_fr": French
  • "lang_hr": Croatian
  • "lang_hu": Hungarian
  • "lang_id": Indonesian
  • "lang_is": Icelandic
  • "lang_it": Italian
  • "lang_iw": Hebrew
  • "lang_ja": Japanese
  • "lang_ko": Korean
  • "lang_lt": Lithuanian
  • "lang_lv": Latvian
  • "lang_nl": Dutch
  • "lang_no": Norwegian
  • "lang_pl": Polish
  • "lang_pt": Portuguese
  • "lang_ro": Romanian
  • "lang_ru": Russian
  • "lang_sk": Slovak
  • "lang_sl": Slovenian
  • "lang_sr": Serbian
  • "lang_sv": Swedish
  • "lang_tr": Turkish
  • "lang_zh-CN": Chinese (Simplified)
  • "lang_zh-TW": Chinese (Traditional)
num unsigned integer Number of search results to return.
  • Valid values are integers between 1 and 10, inclusive.
orTerms string Provides additional search terms to check for in a document, where each document in the search results must contain at least one of the additional search terms.
relatedSite string Specifies that all search results should be pages that are related to the specified URL.
rights string Filters based on licensing. Supported values include: cc_publicdomain, cc_attribute, cc_sharealike, cc_noncommercial, cc_nonderived, and combinations of these.
safe string Search safety level.

Acceptable values are:
  • "high": Enables highest level of SafeSearch filtering.
  • "medium": Enables moderate SafeSearch filtering.
  • "off": Disables SafeSearch filtering. (default)
searchType string Specifies the search type: image If unspecified, results are limited to webpages.

Acceptable values are:
  • "image": custom image search.
siteSearch string Specifies all search results should be pages from a given site.
siteSearchFilter string Controls whether to include or exclude results from the site named in the siteSearch parameter.

Acceptable values are:
  • "e": exclude
  • "i": include
sort string The sort expression to apply to the results.
start unsigned integer The index of the first result to return.

Prepare The request


In [3]:
url = "https://www.googleapis.com/customsearch/v1"
parameters = {"q": "halloween",
              "cx": cx,
              "key": key,
              }

Make the request


In [4]:
page = requests.request("GET", url, params=parameters)

Process Results


In [5]:
results = json.loads(page.text)

Inspecting Results


In [6]:
results.keys()


Out[6]:
[u'kind', u'url', u'items', u'context', u'queries', u'searchInformation']

Inspecting Search Meta Data


In [7]:
results["kind"]


Out[7]:
u'customsearch#search'

In [8]:
results["url"]


Out[8]:
{u'template': u'https://www.googleapis.com/customsearch/v1?q={searchTerms}&num={count?}&start={startIndex?}&lr={language?}&safe={safe?}&cx={cx?}&cref={cref?}&sort={sort?}&filter={filter?}&gl={gl?}&cr={cr?}&googlehost={googleHost?}&c2coff={disableCnTwTranslation?}&hq={hq?}&hl={hl?}&siteSearch={siteSearch?}&siteSearchFilter={siteSearchFilter?}&exactTerms={exactTerms?}&excludeTerms={excludeTerms?}&linkSite={linkSite?}&orTerms={orTerms?}&relatedSite={relatedSite?}&dateRestrict={dateRestrict?}&lowRange={lowRange?}&highRange={highRange?}&searchType={searchType}&fileType={fileType?}&rights={rights?}&imgSize={imgSize?}&imgType={imgType?}&imgColorType={imgColorType?}&imgDominantColor={imgDominantColor?}&alt=json',
 u'type': u'application/json'}

In [9]:
len(results["items"])


Out[9]:
10

In [10]:
results["queries"]


Out[10]:
{u'nextPage': [{u'count': 10,
   u'cx': u'013297447421619698040:gnhhn4xeyns',
   u'inputEncoding': u'utf8',
   u'outputEncoding': u'utf8',
   u'safe': u'off',
   u'searchTerms': u'halloween',
   u'startIndex': 11,
   u'title': u'Google Custom Search - halloween',
   u'totalResults': u'117000000'}],
 u'request': [{u'count': 10,
   u'cx': u'013297447421619698040:gnhhn4xeyns',
   u'inputEncoding': u'utf8',
   u'outputEncoding': u'utf8',
   u'safe': u'off',
   u'searchTerms': u'halloween',
   u'startIndex': 1,
   u'title': u'Google Custom Search - halloween',
   u'totalResults': u'117000000'}]}

In [11]:
results["searchInformation"]


Out[11]:
{u'formattedSearchTime': u'0.53',
 u'formattedTotalResults': u'117,000,000',
 u'searchTime': 0.531343,
 u'totalResults': u'117000000'}

Inspecting a single result


In [12]:
results["items"][0]


Out[12]:
{u'cacheId': u'0b5Oki2-f4EJ',
 u'displayLink': u'en.wikipedia.org',
 u'formattedUrl': u'en.wikipedia.org/wiki/Halloween',
 u'htmlFormattedUrl': u'en.wikipedia.org/wiki/<b>Halloween</b>',
 u'htmlSnippet': u'<b>Halloween</b> or Hallowe&#39;en <sup>6]</sup> also known as Allhalloween, All Hallows&#39; Eve, or All <br>\nSaints&#39; Eve, is a yearly celebration observed in a number of countries on 31&nbsp;...',
 u'htmlTitle': u'<b>Halloween</b> - Wikipedia, the free encyclopedia',
 u'kind': u'customsearch#result',
 u'link': u'http://en.wikipedia.org/wiki/Halloween',
 u'pagemap': {u'cse_image': [{u'src': u"http://upload.wikimedia.org/wikipedia/commons/thumb/a/a2/Jack-o'-Lantern_2003-10-31.jpg/240px-Jack-o'-Lantern_2003-10-31.jpg"}],
  u'cse_thumbnail': [{u'height': u'188',
    u'src': u'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQDphQXJ1JeGZS30rhMAfcAhklES25AKZ-9TS9YwUdycnIqLa7W96Maxjg',
    u'width': u'192'}],
  u'event': [{u'dtstart': u'2014-10-31',
    u'summary': u'First day of Allhallowtide'}],
  u'hcalendar': [{u'dtstart': u'2014-10-31',
    u'summary': u'First day of Allhallowtide'}]},
 u'snippet': u"Halloween or Hallowe'en 6] also known as Allhalloween, All Hallows' Eve, or All \nSaints' Eve, is a yearly celebration observed in a number of countries on 31\xa0...",
 u'title': u'Halloween - Wikipedia, the free encyclopedia'}

Process Results Into a Pandas Data Frame


In [13]:
def process_search(results):
    link_list = [item["link"] for item in results["items"]]
    df = pd.DataFrame(link_list, columns=["link"])
    df["title"] = [item["title"] for item in results["items"]]
    df["snippet"] = [item["snippet"] for item in results["items"]]
    return df
df = process_search(results)
df


Out[13]:
link title snippet
0 http://en.wikipedia.org/wiki/Halloween Halloween - Wikipedia, the free encyclopedia Halloween or Hallowe'en 6] also known as Allha...
1 http://www.history.com/topics/halloween Halloween - Videos, Facts, Origin & Meaning - ... Find out more about the history of Halloween, ...
2 http://www.imdb.com/title/tt0077651/ Halloween (1978) - IMDb Halloween II. A Nightmare on Elm Street. Hallo...
3 http://www.halloween.com/ Halloween 2014 | Halloween.com Halloween fun on the internet, the one source ...
4 http://halloweenmovies.com/ HalloweenMovies™ Tickets now on sale for Halloween - ODEON Cine...
5 http://www.loc.gov/folklife/halloween.html Halloween: The Fantasy and Folklore of All Hal... Beginnings of Halloween celebration; First ori...
6 http://www.spirithalloween.com/ Halloween Costumes - Childrens & Adult Hallowe... Spirit Halloween - Halloween Stores nationwide...
7 http://www.cdc.gov/family/halloween/ CDC - Halloween Health and Safety - Family Health 5 days ago ... Fall celebrations like Hallowee...
8 http://www.partycity.com/category/halloween+co... Halloween Costumes for Kids & Adults - Costume... Halloween costumes for all ages and sizes. Sho...
9 http://www.overkillsoftware.com/halloween/ PAYDAY HALLOWEEN SPECIAL 2014

Getting Results from more pages

Use "start" parameter to skip results from previous pages. To get the next "start" index look it up in "queries.nextPage[0].startIndex"


In [14]:
next_index = results["queries"]["nextPage"][0]["startIndex"]
search_terms = results["queries"]["nextPage"][0]["searchTerms"]

url = "https://www.googleapis.com/customsearch/v1"
parameters = {"q": search_terms,
              "cx": cx,
              "key": key,
              "start": next_index
              }

In [15]:
page = requests.request("GET", url, params=parameters)
results = json.loads(page.text)

In [16]:
def process_search(results):
    link_list = [item["link"] for item in results["items"]]
    df = pd.DataFrame(link_list, columns=["link"])
    df["title"] = [item["title"] for item in results["items"]]
    df["snippet"] = [item["snippet"] for item in results["items"]]
    return df
temp_df = process_search(results)
df = pd.concat([df, temp_df], ignore_index=True)
df


Out[16]:
link title snippet
0 http://en.wikipedia.org/wiki/Halloween Halloween - Wikipedia, the free encyclopedia Halloween or Hallowe'en 6] also known as Allha...
1 http://www.history.com/topics/halloween Halloween - Videos, Facts, Origin & Meaning - ... Find out more about the history of Halloween, ...
2 http://www.imdb.com/title/tt0077651/ Halloween (1978) - IMDb Halloween II. A Nightmare on Elm Street. Hallo...
3 http://www.halloween.com/ Halloween 2014 | Halloween.com Halloween fun on the internet, the one source ...
4 http://halloweenmovies.com/ HalloweenMovies™ Tickets now on sale for Halloween - ODEON Cine...
5 http://www.loc.gov/folklife/halloween.html Halloween: The Fantasy and Folklore of All Hal... Beginnings of Halloween celebration; First ori...
6 http://www.spirithalloween.com/ Halloween Costumes - Childrens & Adult Hallowe... Spirit Halloween - Halloween Stores nationwide...
7 http://www.cdc.gov/family/halloween/ CDC - Halloween Health and Safety - Family Health 5 days ago ... Fall celebrations like Hallowee...
8 http://www.partycity.com/category/halloween+co... Halloween Costumes for Kids & Adults - Costume... Halloween costumes for all ages and sizes. Sho...
9 http://www.overkillsoftware.com/halloween/ PAYDAY HALLOWEEN SPECIAL 2014
10 http://www.popsugar.com/moms/Ultimate-Hallowee... Ultimate Halloween Guide | POPSUGAR Moms Download our Halloween app! ... Felicity to Sc...
11 http://www.yandy.com/Shopping/products/categor... Sexy Halloween Costumes Sexy Halloween costumes up to 75% off! Free sh...
12 http://www.grandinroad.com/halloween-haven/ Halloween Decorations - Halloween Decor - Gran... Shop Grandin Road's collection of outdoor Hall...
13 http://www.amazon.com/Halloween-Donald-Pleasen... Amazon.com: Halloween: Donald Pleasence, Jamie... Halloween stars Jamie Lee Curtis (A Fish Calle...
14 http://www.instructables.com/halloween Halloween Instructables DIY Halloween costumes for adults, kids and pe...
15 http://www.huffingtonpost.com/2014/10/27/hallo... These Are The Most Googled Halloween Costumes ... 1 day ago ... Some states are making a topical...
16 http://www.wilstar.com/holidays/hallown.htm Halloween - The History, Traditions, and Custo... The history of Halloween and its customs start...
17 https://www.etsy.com/browse/halloween Halloween Costumes, Decor & Treats on Etsy Find one-of-a-kind costumes for kids, adults a...
18 https://nrf.com/media/press-releases/record-nu... Record Number of Americans to Buy Halloween Co... Sep 24, 2014 ... More than two-thirds (67.4%) ...
19 http://www.accuweather.com/en/weather-news/eas... Halloween Forecast: Snow, Cold to Blast Northe... 1 day ago ... A shocking blast of cold air and...